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Abstract Coordinate descent algorithms solve optimization problems by suc¬ 
cessively performing approximate minimization along coordinate directions or 
coordinate hyperplanes. They have been used in applications for many years, 
and their popularity continues to grow because of their usefulness in data anal¬ 
ysis, machine learning, and other areas of current interest. This paper describes 
the fundamentals of the coordinate descent approach, together with variants 
and extensions and their convergence properties, mostly with reference to con¬ 
vex objectives. We pay particular attention to a certain problem structure that 
arises frequently in machine learning applications, showing that efficient im¬ 
plementations of accelerated coordinate descent algorithms are possible for 
problems of this type. We also present some parallel variants and discuss their 
convergence properties under several models of parallel execution. 

Keywords coordinate descent • randomized algorithms • parallel numerical 
computing 


1 Introduction 

Coordinate descent (CD) algorithms for optimization have a history that dates 
to the foundation of the discipline. They are iterative methods in which each 
iterate is obtained by fixing most components of the variable vector x at their 
values from the current iteration, and approximately minimizing the objective 
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with respect to the remaining components. Each such subproblem is a lower¬ 
dimensional (even scalar) minimization problem, and thus can typically be 
solved more easily than the full problem. 

CD methods are the archetype of an almost universal approach to algo¬ 
rithmic optimization: solving an optimization problem by solving a sequence 
of simpler optimization problems. The obviousness of the CD approach and 
its acceptable performance in many situations probably account for its long¬ 
standing appeal among practitioners. Paradoxically, the apparent lack of so¬ 
phistication may also account for its unpopularity as a subject for investigation 
by optimization researchers, who have usually been quick to suggest alterna¬ 
tive approaches in any given situation. There are some very notable exceptions. 
The 1970 text of Ortega and Rheinboldt [40] Section 14.6] included a compre¬ 
hensive discussion of “univariate relaxation,” and such optimization specialists 
as Luo and Tseng I2QU2H, Tseng [55], and Bertsekas and Tsitsiklis [5] made 
important contributions to understanding the convergence properties of these 
methods in the 1980s and 1990s. 

The situation has changed in recent years. Various applications (includ¬ 
ing several in computational statistics and machine learning) have yielded 
problems for which CD approaches are competitive in performance with more 
reputable alternatives. The properties of these problems (for example, the low 
cost of calculating one component of the gradient, and the need for solutions 
of only modest accuracy) lend themselves well to efficient implementations of 
CD, and CD methods can be adapted well to handle such special features 
of these applications as nonsmooth regularization terms and a small number 
of equality constraints. At the same time, there have been improvements in 
the algorithms themselves and in our understanding of them. Besides their 
extension to handle the features just mentioned, new variants that make use 
of randomization and acceleration have been introduced. Parallel implemen¬ 
tations that lend themselves well to modern computer architectures have been 
implemented and analyzed. Perhaps most surprisingly, these developments are 
relevant even to the most fundamental problem in numerical computation: 
solving the linear equations Aw = b. 

In the remainder of this section, we state the problem types for which 
CD methods have been developed, and sketch the most fundamental versions 
of CD. Section [5] surveys applications both historical and modern. Section [3] 
sketches the types of algorithms that have been implemented and analyzed, 
and presents several representative convergence results. Section [4] focuses on 
parallel CD methods, describing the behavior of these methods under syn¬ 
chronous and asynchronous models of computation. 

Our approach throughout is to describe the CD methods in their simplest 
forms, to illustrate the fundamentals of the applications, implementations, and 
analysis. We focus almost exclusively on methods that adjust just one coordi¬ 
nate on each iteration. Most applications use block coordinate descent meth¬ 
ods, which adjust groups of blocks of indices at each iteration, thus searching 
along a coordinate hyperplane rather than a single coordinate direction. Most 
derivation and analysis of single-coordinate descent methods can be extended 
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without great difficulty to the block-CD setting; the concepts do not change 
fundamentally. We mention too that much effort has been devoted to devel¬ 
oping more general forms of CD algorithms and analysis, involving weighted 
norms and other features, that allow more flexible implementation and al¬ 
low the proof of stronger and more general (though usually not qualitatively 
different) convergence results. 


1.1 Formulations 

The problem considered in most of this paper is the following unconstrained 
minimization problem: 

min f(x), (1) 

X 

where / : —> K. is continuous. Different variants of CD make further as¬ 

sumptions about /. Sometimes it is assumed to be smooth and convex, some¬ 
times smooth and possibly nonconvex, and sometimes smooth but with a re¬ 
stricted domain. (We will make such assumptions clear in each discussion of 
algorithmic variants and convergence results.) 

Motivated by recent popular applications, it is common to consider the 
following structured formulation: 

min h{x ) := /( x) + Al7(;r), (2) 

X 

where / is smooth, 17 is a regularization function that may be nonsmooth and 
extended-valued, and A > 0 is a regularization parameter. 17 is often convex 
and usually assumed to be separable or block-separable. When separable, 17 
has the form 

n 

x ) = y(3) 

2=1 

where 17; : R —>■ M for all i. The best known examples of separability are the 
fd-norm (in which I7(x) = ||x||i and hence f2i(xi) = \xt\) and box constraints 
(in which l7;(:r;) = I[i iiUi \{.Xi) is the indicator function for the interval [/;,it;]). 
Block separability means that the n x n identity matrix can be partitioned 
into column submatrices Ld, i = 1,2,..., N such that 

N 

n(x) = '£ i n i (uTx). (4) 

2=1 

Block-separable examples include group-sparse regularizers in which 17; (z;) := 
|| Zi || 2 - Formulations of the type ([2]), with separable or block-separable regu¬ 
larizers, arise in such applications as compressed sensing, statistical variable 
selection, and model selection. 
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The class of problems known as empirical risk minimization (ERM) gives 
rise to a formulation that is particularly amenable to coordinate descent; see 
[52] . These problems have the form 


. 1 
mm — 
n 


n 

£*(<?«,) + A g(w), 

i =1 


(5) 


for vectors Ci £ M d , i = 1,2,... ,n and convex functions <f>i, i = 1,2,... ,n 
and g. We can express linear least-squares, logistic regression, support vec¬ 
tor machines, and other problems in this framework. Recalling the following 
definition of the conjugate t* of a convex function t: 

t*{y) =sup(z T y-t{z)), (6) 

Z 

we can write the Fenchel dual [35] Section 31] of ([5]) as follows: 

^n^ { - Xi) + X9 *(~L Cx )' (?) 

i—1 v 7 

where C is the d x n matrix whose columns are Cj, i = 1,2,..., n. The dual 
formulation 0 is has special appeal as a target for coordinate descent, because 
of separability of the summation term. 

One interesting case is the system of linear equations 


Aw = b, where A G R mxn , 


( 8 ) 


which we assume to be a feasible system. The least-norm solution is found by 
solving 

min — ||t^||| subject to Aw = b, (9) 

w£M. n 2 

whose Lagrangian dual is 

min f{x) := ^||v4 T a;||i - b T x. (10) 


(We recover the primal solution from (10) by setting w = 
that (101 is a special case of the Fenchel dual Q obtained 


A T x.) We can see 
from ([5]) if we set 


C «- A T , g{w) 


1 

2 


HI 


2 

2’ 




I{bi} (^i)i 


A = 1/n, 


where I{bi} denotes the indicator function for bi, which is zero at bi and infinite 
elsewhere. (Its conjugate is JA i(s,) = b,s,.) The primal problem (|9j) can be 
restated correspondingly as 


min 

iueR" 


1 1 

~ ^2 T {bi}( A i w ) + -IHlt 

m z ' n 


where Ai denotes the ith row of the matrix A in Q, which has the form 0- 
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1.2 Outline of Coordinate Descent Algorithms 

The basic coordinate descent framework for continuously differentiable mini¬ 
mization is shown in Algorithm [l] Each step consists of evaluation of a single 
component 4 of the gradient V/ at the current point, followed by adjustment 
of the ik component of x , in the opposite direction to this gradient compo¬ 
nent. (Here and throughout, we use [V/(x)]j to denote the *th component of 
the gradient V/(x).) There is much scope for variation within this framework. 
The components can be selected in a cyclic fashion, in which = 1 and 

4+i = [4 mod n] + 1, fe = 0,1,2,... . (11) 

They can be required to satisfy an “essentially cyclic” condition, in which for 
some T > n, each component is modified at least once in every stretch of T 
iterations, that is, 

Uj =0 {4-j} = {1,2, . ..,n}, for all k > T. (12) 

Alternatively, they can be selected randomly at each iteration (though not 
necessarily with equal probability). Turning to steplength we may per¬ 
form exact minimization along the 4 component, or choose a value of otk 
that satisfies traditional line-search conditions (such as sufficient decrease), or 
make a predefined “short-step” choice of ak based on prior knowledge of the 
properties of /. 


Algorithm 1 Coordinate Descent for 0 _ 

Set k «— 0 and choose x° E 

repeat 

Choose index i & E {1, 2,... ,n}; 

x k + 1 4 — x k — Qfc[V f(x k )]i k e{ k for some a & > 0; 

k 4 — k 1; 

until termination test satisfied; 


The CD framework for the separable regularized problem is shown 

in Algorithm [2] At iteration fc, a scalar subproblem is formed by making a 
linear approximation to / along the 4 coordinate direction at the current 
iterate x k , adding a quadratic damping term weighted by l/a*, (where ak 
plays the role of a steplength), and treating the relevant regularization term 
explicitly. Note that when the regularizer f2i is not present, the step is 
identical to the one taken in Algorithm [lj For some interesting choices of 17,; 
(for example 124 ') = | • |), it is possible to write down a closed-form solution 
of the subproblem; no explicit search is needed. The operation of solving such 
subproblems is often referred to as a “shrink operation,” which we denote by 
S /3 and define as follows: 

Sp ( T ) : = min ||x — ' r ll 2 + ^i(x)- (13) 

X Zp 
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By stating the subproblem in Algorithm [2] equivalently as 

min ||x - {Xi k - a k [Vf{x k )\i Jf + ^(x), 
we can express the CD update as z k <— S\a k {x k — ct/c[V f {x k )]i k ). 


Algorithm 2 Coordinate Descent for MM _ 

Set k <— 0 and choose x° G M n ; 

repeat 

Choose index i & G {1, 2,... ,n}; 

«- argmin x (x - x^ k ) T [V f{x k )] ik + ^Wx ~ x\ k III + ADi(x) for some cr fe > 0; 
Ai ^— k -\- 1; 

until termination test satisfied; 


Algorithms [l] and [2] can be extended to block-CD algorithms in a straight¬ 
forward way, by updating a block of coordinates (denoted by the column 
submatrix Ui k of the identity matrix) rather than a single coordinate. In 
Algorithm [2j it is assumed that the choice of block is consistent with the 
block-separable structure of the regularization function C, that is, U lk is a 
concatenation of several of the submatrices Ui in ©■ 


1.3 Application to Linear Equations 


For the formulation (10) that arises from the linear system Aw = b , let us 
assume that the rows of A are normalized, that is, 


||Aj|| 2 = l for i = 1,2,... ,m. 

Applying Algorithm [I] to (101 with a*, = 1, each step has the form 


A+i 


~(A ik A x ~b ik )e ik . 


(14) 


(15) 


If we maintain and update the estimate w k of the solution to the primal 
problem ([ 9 ]) after each update of x k , according to w k = A T x k , we obtain 

w k+1 <- w k - ( A ik A r x k - b ik )Af k =w k - ( A ik w k - b ih )Af h , (16) 


which is the update formula for the Kaczmarz algorithm 22]. Following this 


update, we have using (14) that 


A lk w k+1 = A ik w k 


(Ai k w bi k ) — bi k , 


so that the ik equation in the system Aw = b is now satisfied. This method if 
sometimes known as the “method of successive projections” because it projects 
onto the feasible hyperplane for a single constraint at every iteration. 
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1.4 Relationship to Other Methods 


Stochastic gradient (SG) methods, also undergoing a revival of interest because 
of their usefulness in data analysis and machine learning applications, minimize 
a smooth function / by taking a (negative) step along an estimate g k of the 
gradient V/(cc fc ) at iteration k. It is often assumed that g k is an unbiased 
estimate of X7f(x k ), that is, V/( x k ) = E(g k ), where the expectation is taken 
over whatever random variables were used in obtaining g k from the current 
iterate x k . Randomized CD algorithms can be viewed as a special case of 
SG methods, in which g k = n[V/(a; fc )]j t ej fc , where ik is chosen uniformly at 
random from {1,2,..., n}. Here, ik is the random variable, and we have 

n 

E(g k ) = ~ E «[V/(**)]iei = y.f(x k ), 


certifying unbiasedness. However, CD algorithms have the advantage over gen¬ 
eral SG methods that descent in / can be guaranteed at every iteration. More¬ 
over, the variance of the gradient estimate g k shrinks to zero as the iterates 
converge to a solution x *, since every component of V/(x*) is zero. By con¬ 
trast, in general SG methods, the gradient estimates g k may be nonzero even 
when x k is a solution. 

The relationship between CD and SG methods can also be discerned from 
the Fenchel dual pair © and 0 - SG methods are quite popular for solving 
formulation 0, where the estimate g k is obtained by taking a single term ik 
from the summation and using {cf k w)ci k as the estimate of the gradient 
of the full summation. This approach corresponds to applying CD to the dual 
0, where the component ik of x is selected for updating at iteration k. This 
relationship is typified by the Kaczmarz algorithm for Aw = b , which can be 
derived either as CD applied to the dual formulation (101 or as SG applied to 
the sum-of-squares problem 


1 1 

min -||Aw - b\\l = - A i w ~ h i) 2 - 
w 2 2 zJ 

2=1 


(17) 


CD is related in an obvious way to the Gauss-Seidel method for n x n sys¬ 
tems of linear equations, which adjusts the ik variable to ensure satisfaction 
of the ik equation, at iteration k. (Successive over-relaxation (SOR) modifies 
this approach by scaling each Gauss-Seidel step by a factor (1 + w) for some 
constan w S [0,1), chosen so as to improve the convergence rate.) Standard 
Gauss-Seidel and SOR use the cyclic choice of coordinates ([10, whereas a ran¬ 
dom choice of ik would correspond to “randomized” versions of these methods. 
To make the connections more explicit: The Gauss-Seidel method applied to 
the normal equations for 0 — that is, A T Aw = A T b — is equivalent to apply¬ 
ing Algorithm [I] to the least-squares problem p0 , when the steplength ak is 
chosen to minimize the objective exactly along the given coordinate direction. 
SOR also corresponds to Algorithm 0 with ak chosen to be a factor (1 + w) 
times the exact minimum. These equivalences allow the results of Section [0 
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to be used to derive convergence rates for Gauss-Seidel applied to the normal 
equations, including linear convergence when A T A is nonsingular. Note that 
these results do not require feasibility of the original equations ©■ 


2 Applications 

We mention here several applications of CD methods to practical problems, 
some dating back decades and others relatively new. Our list is necessarily in¬ 
complete, but it attests to the popularity of CD in a wide variety of application 
communities. 

Bournan and Sauer IT] discuss an application to positron emission tomog¬ 
raphy (PET) in which the objective has the form <§ where / is smooth and 
convex and 17 is a sum of terms of the form \xj — Xi q for some pairs of com¬ 
ponents (j, l ) of x and some q £ [1, 2]. Ye et al. [57] apply a similar method to 
a different objective arising from optical diffusion tomography. 

Liu, Paratucco, and Zhang [26j describe a block CD approach for linear 
least squares plus a regularization function consisting of a sum of norms 
of subvectors of x. The technique is applied to semantic basis discovery, which 
learns from data how to identify and classify the functional MRI response of 
a person’s brain when they hear certain English words. 

Canutescu and Dunbrack HU describe a cyclic coordinate descent method 
for determining protein structure, adjusting the dihedral angles in a protein 
chain so that the atom at the end of the chain comes close to a specified 
position in space. 

Florian and Chen HU recover origin-destination matrices from observed 
traffic flows by alternately solving a bilevel optimization problem over two 
blocks of variables: the origin-destination demands and the proportion of each 
origin-destination flow assigned to each arc in the network. 

Breheny and Huang m discuss coordinate descent for linear and logistic 
regression with nonconvex separable regularization terms, reporting results for 
genetic association and gene expression studies. The SparseNet algorithm [33] 
applied to problems with these same nonconvex separable regularizers uses 
warm-started cyclic coordinate descent as an inner loop to solve a sequence of 
problems in which the regularization parameter A in and the parameters 
defining concavity of the regularization functions are varied. 

Friedman, Hastie, and Tibshirani [TS] propose a block CD algorithm for 
estimating a sparse inverse covariance matrix, given a sample covariance ma¬ 
trix S and taking the variable in their formulation to be a modification W of 
S, such that W~ x is sparse. The resulting “graphical lasso” algorithm cycles 
through the rows/columns of W (in the style of block CD), solving a standard 
lasso problem to calculate each update. The same authors HU apply CD to 
generalized linear models such as linear least squares and logistic regression, 
with convex regularization terms. Their framework include such formulations 
as lasso, graphical lasso, elastic net, and the Dantzig selector, and is imple¬ 
mented in the package glmnet. 




Coordinate Descent Algorithms 


9 


Chang, Hsieli, and Lin [T2] use cyclic and stochastic CD to solve a squared- 
loss formulation of the support vector machine (SVM) problem in machine 
learning, that is, 


m 


min 

w 


max(l 



(18) 


where (xi,yt) G M. N x {0,1} are feature vector / label pairs and A is a reg¬ 
ularization parameter. This problem is an important instance of the ERM 
form ([5]). In the best known early application of coordinate descent to SVM, 
Platt p] deals with a hinge-loss formulation of SVM, which is identical to 
(18) except that the square on each term of the summation is omitted. The 
dual of this problem has bounds on its variables along with a single linear 
constraint. Platt’s procedure SMO (for “sequential minimal optimization”), 
applied to the dual, changes two variables at a time, with the variable pair 
chosen according to a “greedy” criterion and the search direction chosen to 
maintain feasibility of the linear constraint. 

Sardy, Bruce, and Tseng [5D] consider the basis-pursuit formulation of 
wavelet denoising: 

min -\\$x-y\\l +A||a:|| 1 . 


This formulation is equivalent to the well known lasso of Tibshirani [53] and has 
become famous because of its applicability to sparse recovery and compressed 
sensing. Although this formulation fits the ERM framework ([ 5 ]) and could thus 
be dualized before applying CD, the approach of [50] applies block CD directly 
to the primal formulation. 

Applications of block CD approaches to transceiver design for cellular net¬ 
works and to tensor factorization are discussed in Razaviyayn [451 Section 8]. 

Finally, we mention several popular problem classes and algorithms that 
can be interpreted as CD algorithms, but for which such an interpretation may 
not be particularly helpful in understanding the performance of the algorithm. 
First, we consider low-rank matrix completion problems in which we are pre¬ 
sented with limited information about a rectangular matrix M € R mx " and 
seek matrices U G R raxr ' and V G M mxr (with r small) such that UV T is con¬ 
sistent with the observations of M. When the observations satisfy a restricted 
isometry property (an assumption commonly made in compressed sensing; see 
[T51 Definition 3.1] for a definition that applies to matrix completion), the block 
CD approach of Jain, Netrapalli, and Sanghavi eh Algorithm 1] converges to 
a solution. This approach defines the objective to be the least-squares fit be¬ 
tween the observations and their predicted values according to the product 
UV T — a function that is nonconvex with respect to ( U , V ) — and minimizes 
alternately over U and V, respectively. Standard analysis of CD for nonconvex 
functions would yield at best stationarity of accumulation points, but much 
stronger results are attained in [21] because of special assumptions that are 
made on the problem in this paper. 
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Second, we consider the “alternating-direction method of multipliers” (ADMM) 
[HE], which has gained great currency in the past few years because of its 
usefulness in solving regularized problems in statistics and machine learning, 
and in designing parallel algorithms. Each major iteration of ADMM consists 
of an (approximate) minimization of the augmented Lagrangian function for a 
constrained optimization problem over each block of primal variables in turn, 
followed by an update to the Lagrange multiplier estimates. It might seem 
appealing to do multiple cycles of updating the primal variable blocks, in the 
manner of cyclic block CD, thus finding a better approximation to the so¬ 
lution of each subproblem over all primal variables and moving the method 
closer to the standard augmented Lagrangian approach. Eckstein and Yao eh) 
show, however, that this “approximate augmented Lagrangian” approach has 
a fundamentally different theoretical interpretation from ADMM, and a com¬ 
putational comparison between the two approaches Section 5] appears to 
show an advantage for ADMM. 


3 Coordinate Descent: Algorithms, Convergence, Implementations 

We now describe the most important variants of coordinate descent and present 
their convergence properties, including the proofs of some fundamental results. 
We also discuss the implementation of accelerated CD methods for problems 
of the form 0 and for the Kaczmarz algorithm for Aw = b. As mentioned 
in the introduction, we deal with the most elementary framework possible, to 
expose the essential properties of the methods. 


3.1 Powell’s Example 

We start with a simple but intriguing example due to Powell (44] formula (2)] 
of a function in R 3 for which cyclic CD fails to converge to a stationary point. 
The nonconvex, continuously differentiable function / : R 3 —M is defined as 
follows: 

3 

f(x !,X 2 ,X 3 ) = —(xiX2 + x 2 x 3 + XiX 3 ) + ^(|Xi| - 1) + . (19) 

i= 1 

It has minimizers at the corners (1,1,1) T and (—1, —1, — 1) T of the unit cube, 
but coordinate descent with exact minimization, started near (but just outside 
of) one of the other vertices of the cube cycles around the neighborhoods of 
six points that are close to the six non-optimal vertices. Powell shows that 
the cyclic nonconvergence behavior is rather special and is destroyed by small 
perturbations on this particular example, and we can note that a randomized 
coordinate descent method applied to this example would be expected to con¬ 
verge to the vicinity of a solution within a few steps. Still, this example and 
others in (441 make it clear that we cannot expect a general convergence result 
for nonconvex functions, of the type that are available for full-gradient descent. 
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Fig. 1 Example of Powell I44| showing nonconvergence of cyclic coordinate descent. 


Results are available for the nonconvex case under certain additional assump- 
tions that still admit interesting applications. Bertsekas [11 Proposition 2.7.1] 
describes convergence of a cyclic approach applied to nonconvex problems, 
under the assumption that the minimizer along any coordinate direction from 
any point x is unique. More recent work m focuses on CD with two blocks of 
variables, applied to functions that satisfy the so-called Kurdyka-Lojasiewicz 
(KL) property, such as semi-algebraic functions. Convergence of subsequences 
or the full sequence {x fc } to stationary points can be proved in this setting. 


3.2 Assumptions and Notation 

For most of this section, we focus on the unconstrained problem 0 , where 
the objective / is convex and Lipschitz continuously differentiable. In some 
places, we assume strong convexity with respect to the Euclidean norm, that 
is, existence of a modulus of convexity a > 0 such that 

f(y) > f(x)+ Vf{x) T (y - x)+ ^\\y - x\\l, for all x, y. (20) 

(Henceforth, we use || • || to denote the Euclidean norm || ■ || 2 , unless otherwise 
specified.) We define Lipschitz constants that are tied to the component direc¬ 
tions, and are key to the algorithms and their analysis. The first set of such 
constants are the component Lipschitz constants , which are positive quantities 
Li such that for all x £ and all t £ R we have 


|[V/(x + tei)]i - [V/(x)],| < L t \t\ 


(21) 
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We define the coordinate Lipschitz constant L max to be such that 

^max = max Z/j. (22) 

i=l,2,...,rs 

The standard Lipschitz constant L is such that 

\\X7f(x + d)~X7f(x)\\<L\\d\l (23) 

for all x and d of interest. By referring to relationships between norm and 
trace of a symmetric matrix, we can assume that 1 < L/L max < n. (The 
upper bound is achieved when f(x) = e(e T x ), for e = (1,1,..., 1) T .) We also 
define the restricted Lipschitz constant L les such that the following property 
is true for all x £ R", all t £ M, and all* = 1, 2,..., n: 

\\Vf(x + t ei )-Vf(x)\\<L res \t\. (24) 

Clearly, L les < L. The ratio 


A. .— L res /L max (25) 

is important in our analysis of asynchronous parallel algorithms in Section]!] In 
the case of / convex and twice continuously differentiable, we have by positive 
semidefiniteness of the V 2 f{x) at all x that 

I[V 2 /0r)y < ([V 2 /(*)]«[V 2 /(*)]«) 1/2 , 

from which we can deduce that 


1 < A < ffn. 

However, we can derive stronger bounds on A for functions / in which the 
coupling between components of x is weak. In the extreme case in which / 
is separable, we have A = 1. The coordinate Lipschitz constant corresponds 
L max to the maximal absolute value of the diagonal elements of the Hessian 
V 2 /(cc), while the restricted Lipschitz constant L res is related to the maximal 
column norm of the Hessian. Therefore, if the Hessian is positive semidefinite 
and diagonally dominant, the ratio A is at most 2. 

The following assumption is useful in the remainder of the paper. 

Assumption 1 The function / in (JT| is convex and uniformly Lipschitz con¬ 
tinuously differentiable, and attains its minimum value f* on a set S. There 
is a finite Rq such that the level set for f defined by x° is bounded, that is, 

max maxjllx — x*|| : f(x) < f(x 0 )} < R 0 

x*eS x 


( 26 ) 
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Algorithm 3 Randomized Coordinate Descent for (jTJ) 

Choose x° £ K n ; 

Set k <— 0; 

repeat 

Choose index i^ with uniform probability from {1,2,..., n}, independently of choices 
at prior iterations; 

Set x k+1 x k — a/c[V f(x k )\i k ei k for some oik > 0; 
k «— k + 1; 

until termination test satisfied; 


3.3 Randomized Algorithms 


In randomized CD algorithms, the update component i k is chosen randomly 
at each iteration. In Algorithm [3] we consider the simplest variant in which 
each i k is selected from {1,2 ,n} with equal probability, independently of 
the selections made at previous iterations. (We can think of this scheme as 
“sampling with replacement” from the set {1,2,... ,n}.) 

We denote expectation with respect to a single random index ik by Ei k (-), 
while E(-) denotes expectation with respect to all random variables ig, i\, ii,.... 

We prove a convergence result for the randomized algorithm, for the sim¬ 
ple steplength choice otk = l/£ m ax- (The proof is a simplified version of the 
analysis in Nesterov [371 Section 2]. A result similar to (27) is proved by 
Shalev-Schwartz and Tewari m for certain types of ^-regularized problems.) 


Theorem 1 Suppose that Assumption [7] holds. Suppose that ak = 1/L max in 
Algorithm [3J Then for all k > 0 we have 


E{f{x k )) — f* < 2nLmaxR o . 


(27) 


When o > 0 in (20), we have in addition that 


e {f(x k )) " (/(x°) - r). 


(28) 


Proof By application of Taylor’s theorem, and using (21) and (22), we have 
f(x k+1 ) = f {x k - a k [Vf(x k )\ ik e ik ) 


1 


< /(**) - a k [Vf{x k )\i + ^a 2 k L ik [Vf(x k )\i k 

Tmax 
~2 


<f( x k )~a k [l-^a k )[Vf(x k )}l 


= f{x k )-7^Vf(.x k )}l, 

Z 1j in nv 


(29) 
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where we substituted the choice a k = 1/L m ax in the last equality. Taking the 
expectation of both sides of this expression over the random index i k , we have 

1 1 m 

EiJ(x k+1 ) < f(x k ) - —--£[V/(:r fc )] 2 

z— 1 

= /(^')- 9 - / ^l|V/(x fc )|| 2 . (30) 

(We used here the facts that does not depend on i k , and that i k was chosen 
from among {1,2 ,,n} with equal probability.) We now subtract f(x*) from 
both sides this expression, take expectation of both sides with respect to all 
random variables io, ii, ■ ■ ., and use the notation 

<f> k :=E{f{x k ))- /*. (31) 


to obtain 

fa+i <fa- (||V/(* fc )|| 2 ) < <t> k - [E(\\Vf(x k )\\)] 2 . (32) 

Z71i^max Z77/i^ m ax 

(We used Jensen’s Inequality in the second inequality.) By convexity of / we 
have for any x* £ S that 

f(x k ) - f* < Vf(x k ) T (x k - x *) < ||V/(x fe )||||a; fe - **|| < R 0 \\X7f(x k )\\, 


where the final inequality is because f{x k ) < f(x°), so that x k is in the level 
set in (26). By taking expectations of both sides, we obtain 

E(\\\7f(x k )\\) > ^ k . 

Kq 


When we substitute this bound into (32), and rearrange, we obtain 

1 1 


<t>k — <t>k +1 > 


On T d 2 ^k ’ 
^a-L/ max rip 


We thus have 


1 1 _ (pk — (j>k+ 1 > (f'k — <t>k +1 > 1 

4*k+l (frk 4 > k4 ) k +1 4*k q 

By applying this formula recursively, we obtain 

11 k k 

— >-1-o > - 

<t>k <t>o 2 nL max-Eg 2?T.Z/ max J?Q 


so that (27) holds, as claimed. 

In the case of / strongly convex with modulus cr > 0, we have by taking 
the minimum of both sides with respect to y in (20), and setting x = x k , that 
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By using this expression to bound ||V/(ar)|| 2 in (32), we obtain 


4>k+1 < — 


nL„ 


~(l>k = 1 - 


nL n 


Recursive application of this formula leads to (28). 


Note that the same convergence expressions can be obtained for more re¬ 
fined choices of steplength a*,, by making minor adjustments to the logic in 


(291. For example, the choice a*, = 1/ L ik leads to the same bounds (271 and 


(281. The same bounds hold too when au is the exact minimizer of / along 


the coordinate search direction; we modify the logic in (29) for this case by 
taking the minimum of all expressions with respect to ctk , and use the fact 
that ctk = 1 /T max is in general a suboptimal choice. 


We can compare (27) with the corresponding result for full-gradient descent 


with constant steplength ctk = 1/L (where L is from (23)). The iteration 


x k+i =x k_ 1 V/(x fe ) 

1J 

leads to a convergence expression 

f(x k )-r< 2 L f° 

(see, for example, [55]). Since, as we have noted, L can be as large as nL B 


(33) 


the bound in this expression may be equivalent to (27) in extreme cases. More 


typically, these two Lipschitz constants are comparable in size, and the ap¬ 


pearance of the additional factor n in (27) indicates that we pay a price in 
terms of slower convergence for using only one component of V/(a; fe ), rather 
than the full vector. 

Expected linear convergence rates have been proved under assumptions 
weaker than strong convexity; see for example the “essential strong convexity” 
property of [28], the “optimal strong convexity” property of [22], the “gener¬ 
alized error bound” property of [51] . and [55] Assumption 2], which concerns 
linear growth in a measure of the gradient with distance from the solution set. 

A variant on Algorithm [3] uses “sampling without replacement.” Here the 
computation proceeds in “epochs” of n consecutive iterations. At the start of 
each epoch, the set {1,2,..., rz} is shuffled. The iterations then proceed by 
setting ik to each entry in turn from the ordered set. This kind of randomiza¬ 
tion has been shown in several contexts to be superior to the sampling-with- 
replacement scheme analyzed above, but a theoretical understanding of this 
phenomenon remains elusive. 


Randomized Kaczmarz Algorithm. It is worth proving an expected linear con¬ 


vergence result for the Kaczmarz iteration (161 for linear equations Aw = b 


as a separate, more elementary analysis. In one sense, the result is a special 
case of Theorem [l] since, as we showed above, the iteration (16) is obtained by 
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applying Algorithm [3] to the dual formulation (10). In another sense, the re¬ 
sult is stronger, since we obtain a linear rate of convergence without requiring 
strong convexity of the objective (10), that is, the system Aw = b is allowed 
to have multiple solutions. 

We denote by A m i njnz the minimum nonzero eigenvalue of AA T and let P(-) 
denote projection onto the solution set of Aw = b. We have 


lk fc+1 ^ P(w k+1 )\\ 2 < ||w fe - Al(A. lk w k - b lk ) - P(w k )|| 2 
= l\\w k -P(w k )\\ 2 -(A ik w k -b ik ) 2 , 

where we have used normalization of the rows |l4| ) and the fact that Ai k P(x k ) = 
bi k . By taking expectations of both sides with respect to ik , we have 

E lk \\w k+1 - P{w k+l )f < || W fc - P(w> fc )|| 2 - E lk (A ik w k - b ik ) 2 

= \\\w k ~ P(w k )f - -\\Aw k -bf 
2 m 

< \\w k -P(w k )W 2 . 

By taking expectations of both sides with respect to all random variables 
io, h ,..., and proceeding recursively, we obtain 

\ . \ k 

^mirijnz \ II 0 r>( 0\112 

1 - ! — \\w — F{w )\\ . 

m ) 


E\\w k -P{i 


< 


(This analysis is slightly generalized from Strohmer and Vershynin [53| to allow 
for nonunique solutions of Aw = b ; see also [21].) 


3.4 Accelerated Randomized Algorithms 


The accelerated randomized algorithm, specified here as Algorithm [4j was 
proposed by Nesterov EZj. It assumes that an estimate is available of modulus 
of strong convexity a > 0 from as well as estimates of the component-wise 
Lipschitz constants Li from (21). (The algorithm remains valid if we simply 
use Tmax in place of Li k for all k.) 

The approach is a close relative of the accelerated (full-)gradient methods 
that have become extremely popular in recent years. These methods have 
their origin in a 1983 paper of Nesterov [3S] and owe much of their recent 
popularity to a recent incarnation known as FISTA |2j and an exposition in 
Nesterov’s 2004 monograph [36], as well as ease of implementation and good 
practical experience. In their use of momentum in the choice of step - the 
search direction combines new gradient information with the previous search 
direction — these methods are also related to such other classical techniques 
as the heavy-ball method (see [33]) and conjugate gradient methods. 

Nesterov m Theorem 6] proves the following convergence result for Algo¬ 
rithm 131 
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Algorithm 4 Accelerated Randomized Coordinate Descent for (jTJ) 

Choose x° £ R n ; 

Set k +- 0, v° <— x°, 7 _i <— 0; 

repeat 

Choose 7 j. to be the larger root of 


2 7 k (, 7 k<?\ 2 

7fe-= ( 1-) 7 fc -i- 

n \ n / 


Set 




rc ~ 7fc CT 
7 fe (n 2 - o-) ’ 




7fc°~ . 

n 


(34) 


Set y k +- a k v k + (1 — 

Choose index £ (1, 2,. .., n} with uniform probability and set d k = f(y k )]i k £i k ; 
Set x^ 1 <- y k - {1/Li k )d k -, 

Set v k+1 e- 0 k v k + (1 - /3 k )y k - (■y k /L ik )d k ; 
h i — k 1; 

until termination test satisfied; 


Theorem 2 Suppose that Assumption^ holds, and define 


S 0 := sup imaxll® -a;*|| + (/O )~f*)/n ■ 

x*£S 


Then for all k > 0 we have 

E{/{x k )) - r 

\j a / L n 


< S 0 - 


<S 0 


k+l 


1 + 


2 n 


- 1 - 


2n 


k+l 


k + l 


(35) 

(36) 


In the strongly convex case cr > 0, the term (1 + i/cr/L max /(2n)) ?£+1 
eventually dominates the second term in brackets in (35), so that the lin¬ 


ear convergence rate suggested by this expression is significantly faster than 
the corresponding rate (28) for Algorithm [3j Essentially, the measure a/L mSLK 
of conditioning in (28) is replaced by its square root in (35), suggesting a de¬ 
crease by a factor of \JL max /a in the number of iterations required to meet a 
specified error tolerance. In the sublinear rate bound (36), which holds even for 
weakly convex /, the 1 fk bound of 


is replaced by a 1 /k 2 factor, implying 
a reduction from 0(l/e) to 0(1/y/e) in the number of iterations required to 
meet a specified error tolerance. 


3.5 Efficient Implementation of the Accelerated Algorithm 

One fact detracts from the appeal of accelerated CD methods over standard 
methods: the higher cost of each iteration of Algorithm |4) Both standard and 
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Algorithm 5 Accelerated Randomized Kaczmarz for (8), (14) 


Choose w° £ K n ; 

Set k <— 0, <— w°, 7 —i <r- 0; 

repeat 

Choose 7 j. to be the larger root of 


2 7 k (, 7 k<r\ 2 

7 k -= ( 1 -) 7fc-i- 

n \ n / 


Set 


•<— 


7 fe (n 2 - o-) ’ 




7fc°~ . 

n 


Set y k -f- + (1 - a k )w k \ 

Choose index ik E {1,2, with uniform probability and set 

Set <- y k - d k ; 

Set v k+1 4 - /3 fe C fe + (1 - P k )y k - "/kd k ; 
k 4— fc + 1; 

until termination test satisfied; 




(37) 




fc 


accelerated variants require calculation of one element of the gradient, but 
Algorithm [ 3 ] requires an update of just a single component of x, whereas Al¬ 
gorithm [4] also requires manipulation of the generally dense vectors y and v. 
Moreover, the gradient is evaluated at x k in Algorithm [3J where the argument 
changes by only one component from the prior iteration, a fact that can be 
exploited in several contexts. In Algorithm [4J the argument y k for the gradi¬ 
ent changes more extensively from one iteration to the next, making it less 
obvious whether such economies are available. However, by using a change of 
variables due to Lee and Sidford [S3], it is possible to implement the acceler¬ 
ated randomized CD approach efficiently for problems with certain structure, 
including the linear system Aw = b and certain problems of the form ([5|. 

We explain the Lee-Sidford technique in the context of the Kaczmarz algo¬ 
rithm for ([8]) , assuming normalization of the rows of A (14). As we explained in 
(16), the Kaczmarz algorithm is obtained by applying CD to the dual formu¬ 
lation ( |10[ ) with variables x, but operating in the space of “primal” variables w 
using the transformation w = A T x. If we apply the transformations v k = A T v k 
and y k = A T y k to the other vectors in Algorithm [4J and use the fact of nor- 
malization (p~4] ) (and hence (AA T )u = 1 for all* = 1,2,..., m) to note that 
Li = 1 in (21), we obtain Algorithm [5J 

When the matrix A is dense, there is only a small factor of difference 
between the per-iteration workload of the standard Kaczmarz algorithm and 
its accelerated variant, Algorithm [5] Both require 0(m + n) operations per 
iteration. However, when A is sparse, the computational difference between the 
two algorithms becomes substantial. At iteration k 7 the standard Kaczmarz 
algorithm requires computation proportion to a small multiple of the number 
of nonzeros in row Ai k (which we denote by |Aj fc |). Meanwhile, iteration k 
of Algorithm [H] requires manipulation of the dense vectors v k and y k — both 
0(n) processes — and the benefits of sparsity are lost. This apparent defect was 
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partly remedied in [29] by “caching” the updates to these vectors, resulting in 
a number of cycles within which updates gradually “fill in.” The more effective 
approach of [23] performs a change of variables from v k and y k to two other 
vectors v k and y k that can be updated in 0(\A ik |) operations. To describe this 
representation, we start by noting that if we substitute for w k and w k+1 in the 
formulas of Algorithm [ 5 J we obtain the updates to v k and y k in the following 
form: 

[v k+1 y k+ i] = [v k y k ] R k - S k , (38) 


where 


Rk '■= 
S k := 


Pk Ctk+lPk 
(1 — Pk) (1 — otk+i(3k)_ 

( A iJ k - K) A Z [lk (1 - 


Oik+l + ttfc+l7fc)] ■ 


Note that Rk is a 2 x 2 matrix while Sk is an n x 2 matrix with nonzeros 
only in those rows for which Af k has a nonzero entry. We define a change of 
variables based on another 2x2 matrix Bk , as follows: 


[v k y k ] = [v k y k ] B k , 


(39) 


where we initialize with Bq = I. By substituting this representation into (381, 
we obtain 

[v k+ 1 y k+1 ] B k+ 1 = [v k y k ] B k R k - s k , 


so we can maintain validity of the representation (39) at iteration k + 1 by 
setting 

B k+1 := B k R k , [v k+1 y k+1 ] := [v k y k ] - S k B^ v (40) 


The computations in (40) can be performed in 0(\Ai k |) operations, and can re¬ 
place the relatively expensive computations of y k and v k+1 in Algorithm [ 5 ] The 
only other operation of note in this algorithm — computation of Ai k y k — bi k — 
can also be performed in 0(\A ik |) operations using the ( v k , y k ) representation, 
by noting from (39) that 


= (A ik v k )(B k ) i 2 + (A lk y k ){B k ) 22 . 


This efficient implementation can be extended to the dual empirical risk 
minimization problem ([7]) for certain choices of regularization function <?(•), for 
example, g(z) = ||z|| 2 /2; see [25]. As pointed out in [23], the key requirement 
for the efficient scheme is that the gradient term [V f(y k )\i k can be evaluated 
efficiently after an update to the two vectors in the alternative representation 
of y k , and to the two coefficients in this representation. Another variant of this 
implementation technique appears in [16J Section 5]. 
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3.6 Cyclic Variants 


We have the following result from |3] for the cyclic variant of Algorithm [lj 

Theorem 3 Suppose that Assumption^ holds. Suppose that ak = 1/L max 
Algorith m [7} with the index ik at iteration k chosen according to the eye 
ordering (fTlJ) (with io = 1). Then for k = n, 2 n, 3 n ,..., we have 

4nL max (l + nL /A max )i?Q 


/(* fc ) -r< 


k + 8 


(41) 


When a > 0 in the strong convexity condition (20), we have in addition for 
k = n, 2 n, 3 n ,... that 

k/n 


f(x«) - r < i - 


2Amax(l + rl ^ 2 /^max) 


(ffri-n- ( 42 ) 


Proof The result (41) follows from Theorems 3.6 and 3.9 in [3] when we note 
that (i) each iteration of Algorithm BCGD in [3] corresponds to a “cycle” of 
n iterations in Algorithm [lj (ii) we update coordinates rather than blocks, so 
that the parameter p in [3]Ts equal to n; (iii) we set L max and L m i n in [3] both 
to L max . 

Comparing the complexity bounds for the cyclic variant with the corre¬ 
sponding bounds proved in Theorem[l]for the randomized variant, we see that 
since L > L max in general, the numerator in (41) is 0(n 2 ), in contrast to 


0(n ) term in (27). A similar factor of n in seen in comparing (28) to (421, 
when we note that (1 — e) 1,/ra « 1 — e/n for small values of e. The bounds in 
Theorem [3] are deterministic, however, rather than being bounds on expected 
nonoptimality, as in Theor em [Tj 

We noted in Subsection 3.2 that the ratio L/L max lies in the interval [1, n] 
when / is a convex quadratic function and both parameters are set to their 
best values. Lower values of this ratio are attained on functions that are “more 
decoupled” and larger values attained when there is a greater dependence 
between the coordinates. Larger values lead to weaker bounds in Theorem [3j 
which accords with our intuition; we expect CD methods to require more 
iterations to resolve the coupling of the coordinates. 

We are free to make other, larger choices of L max ; they need only satisfy 
the conditions (21) and (22). Larger values of L max lead to shorter steps ak = 
1 /L n 


bound in (41) becomes 


and different complexity expressions. For L n 

4 n(n + 1 )LRq 
k + 8 ’ 


= L , for example, the 


which is worse by a factor of approximately 2 n 2 than the bound (33) for the 
full-step gradient descent approach. For L max = y/nL, we obtain 

8 n 3 ' 2 LR 2 
k + 8 ’ 

which still trails (33) by a factor of 4?r 3 / 2 . 
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3.7 Extension to Separable Regularized Case 


In this section we consider the separable regularized formulation where 

/ is smooth and strongly convex, and each J? i = 1,2,... ,n is convex. We 
prove a result similar to the second part of Theorem[I]for a randomized version 
of Algorithm [2j The proof is a simplified version of the analysis from [48] . It 
makes use of the following assumption. 

Assumption 2 The function f in ^ is uniformly Lipschitz continuously 
differentiable and strongly convex with modulus a > 0 (see The functions 

fli, i = 1,2 ,,n are convex. The function h in § attains its minimum value 
h* at a unique point x*. 


Our result uses the coordinate Lipschitz constant L max for /, as defined 
in (22). Note that the modulus of convexity a for / is also the modulus of 
convexity for h. By elementary results for convex functions, we have 


h(ax + (1 — a)y) < ah(x) + (1 — a)h(y) — -cra(l — a)||x — y\\ 2 . (43) 

Theorem 4 Suppose that Assumption [£] holds. Suppose that the indices ik in 
Algorithm [1] are chosen independently for each k with uniform probability from 
{1, 2,..., n}, and that a /. = 1/L max . Then for all k > 0, we have 


E ( h{x k )) - h* < 



a 


nL 


max 


(h(x°) — h*). 


(44) 


Proof Define the function 


H(x k ,z) := /(aT) + V/(a; fe ) i (z - aA) + -L mSiX \\z - x 


k ||2 


Xf2(z), 


and note that this function is separable in the components of z, and attains its 
minimum over z at the vector z k whose ik component is defined in Algorithm]^ 
Note by strong convexity (20) that 


H{x k , z) < f(z) - * a\\z - a; fc || 2 + ^L m ^\\z - x k \\ 2 + A Q(z) 

= h ( z ) + ^max - a)\\z- X k \\ 2 . (45) 

We have by minimizing both sides over 2 in this expression that 
H(x k ,z k ) = min H(x k ,z) 

Z 


< min h{z) + ^(L max - cr)\\z - x k \\ 2 

Z Z 

< min h{ax* + (1 — a)x k ) + -(L max — cr)a 2 ||a: fe — x*|| 2 

aG[0,l] 2 

< min ah* + (1 — a)h(x k ) + - [(L max — a)a 2 — aa(l — a)l ||a; fe — a;*|| 2 

aG[0,l] 2 

<j^-h*+(l--^-)h{x k ), (46) 

I'max \ I'max / 
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where we used (45) for the first inequality, (431 for the third inequality, and 
the particular value a = o /L max for the fourth inequality (for which value the 
coefficient of \\x k — cc*|| 2 vanishes). Taking the expected value of h(x k+1 ) over 
the index ih, we have 


Ei h h(* k+1 ) =lib 


2=1 


f{x k + {z k - x k )e t ) + \ fli(z k ) + A ^ C2 3 {x k ) 




I f 1 

< - E /(**) + [V/(* fc )]i(^ - x k ) + -L max (z k - x k f 

II 2=1 ^ 


+Xfi i {z k ) + Xj2^j( x j) 


n — 1 h{x k ) + - [f(x k ) + X7f(x k ) T (z k - x k ) 


+ ^L ma 4z k - x k f 


Xf2(z k ) 


- — -h(x k ) + -H(x k ,z k ). 
n n 


By subtracting h* from both sides of this expression, and using (46) to sub¬ 
stitute for H(x k 1 z k ), we obtain 


E lk h{x k+1 ) - h* <|1- 


nL„ 


{h(x k ) — h*). 


By taking expectations of both sides of this expression with respect to the 
random indices *o, H,* 2 , • • ■, ik-i, we obtain 


E{h{x k+1 )) - h* < ( 1 - 


(E(h(x k ))-h*). 


^fmax / 

The result follows from a recursive application of this formula. 


A result similar to (27) can be proved for the case in which / is convex but 
not strongly convex, but there are a few technical complications, and we refer 
the reader to [48] for details. 

An extension of the fixed-step approach to separable composite objectives 
with nonconvex smooth part / is discussed in [41] , where it is shown 
that accumulation points of the sequence of iterates are stationary and that a 
measure of optimality decreases to zero at a sublinear (1/fc) rate. 


3.8 Computational Notes 

A full computational comparison between variants of CD (and between CD 
and other methods) is beyond the scope of this paper. Nevertheless it is worth 
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asking whether various aspects of the convergence analysis presented above 
— in particular, the distinction between CD variants — can be observed in 
practice. To this end, we used these methods to minimize a convex quadratic 
f(x) = (1/2 )x T Qx (with Q symmetric and positive semidefinite) for which 
x* = 0 and f* = 0. We constructed Q by choosing an integer r from 1, 2,..., n 
and parameters r] G [0,1] and C > 0, and defining 

Q-Vr^E V r T v + (ll T , (47a) 

V r ,r, :=vV+(l-v)E r , (47b) 

E r := [I r xr | 0 rx (n— r)] ■ (47c) 


where V G R nxr is a random matrix with r < n orthogonal columns, E 
is an r x r positive diagonal matrix whose diagonal elements were chosen 
from a log-uniform distribution to have a specified condition number (with 
maximum diagonal of 1), and 1 is the vector (1,1,...,1) T . For convenience, 
we normalized Q so that its maximum diagonal — and thus L max (22) - is 

1 . 

By choosing 77 and £ appropriately, we can obtain a range of values for 
the quantities described in Subsection |3.2| which enter along with the smallest 
singular value into the convergence expression. For example, by setting £ = 0 
and 77 = 0 we obtain a randomly oriented matrix, possibly singular, with 
a specified range of nonzero eigenvalues. Nonzero values of 77 and £ induce 
different types of orientation bias. In particular, we see that yl (251 increases 
toward its upper bound of \fn as £ increases away from zero. 

We tested three CD variants. 


— CYCLIC: Cyclic CD, described in Subsection |3.6| 

— IID: Randomized CD using sampling with replacement: Algorithm [3] 

— EPOCHS: The “sampling without replacement” variant of Algorithm [3j 

described following the proof of Theorem [Tj 

For each variant, we tried both a fixed steplength ak = l/Lmax and the 
optimal steplength ctk = 1 /Qi k ,i k - Thus, there were a total of six algorithmic 
variants tested. 

The starting point x° was chosen randomly, with all components from the 
unit normal distribution TV(0,1). The algorithms were terminated when the 
objective was reduced by a factor of 10 -6 over its initial value f{x°). 

The speed of convergence varied widely according to the problem construc¬ 
tion parameters 77 , A, and cond(A), but we can make some general observa¬ 
tions. First, on problems that are not well conditioned, the function values 
f(x k ) decreased rapidly at first, then settled into a linear rate of decrease. 
This linear rate held even for problems in which Q was singular — a signif¬ 
icant improvement over the sublinear rates predicted by the theory. Second, 
the EPOCHS variant of randomized CD tended to converge faster than the 
IID version, though rarely more than twice as fast. Third, the use of the op¬ 
timal step was usually better than the fixed step (with sometimes up to six 
times fewer iterations), but this was by no means always the case. Fourth, 
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while there were extensive regimes of parameter values in which all six vari¬ 
ants performed similarly, there were numerous “stressed” settings in which the 
CYCLIC variants are much slower than the randomized variants, by factors 
of 10 or more. 


4 Parallel CD Algorithms 

CD methods lend themselves to different kinds of parallel implementation. 
Even basic algorithm frameworks such as Algorithm [Tj may be amenable to 
application-specific parallelism, when the computations involved in evaluating 
a single element of the gradient vector are substantial enough to be spread out 
across cores of a multicore computer. We concern ourselves here with more 
generic forms of parallelism, which involve multiple instances of the basic CD 
algorithm, running in parallel on multiple processors. 

We can distinguish different types of parallel CD algorithms. Synchronous 
algorithms are those that partition the computation into pieces that can be 
executed in parallel on multiple processors (or cores of a multicore machine), 
but that synchronize frequently across all processors, to ensure consistency 
of the information available to all processors at certain points in time. For 
example, each processor could update a subset of components of x in parallel 
(with the subsets being disjoint), and the synchronization step could ensure 
that the results of all updates are shared across all processors before further 
computation occurs. The synchronization step often detracts from the perfor¬ 
mance of algorithms, not only because some processors may be forced to idle 
while others complete their work, but also because the overheads associated 
with (hardware and software) locking of memory accesses can be high. Thus, 
asynchronous methods, which weaken or eliminate the requirement of consis¬ 
tent information across processors, are preferred in practice. Analysis of such 
methods is more difficult, but results have been obtained that accord with 
practical experience of such methods. Indeed, it can be verified that in certain 
regimes, linear speedup can be expected across a modest number of processors. 


4.1 Synchronous Parallelism 

We mention several synchronous parallel variants of CD that appear in the 
recent literature. We note that in the some of these papers, the computa¬ 
tional results were obtained by implementing the methods in an asynchronous 
fashion, disregarding the synchronization step required by the analysis. 

Bradley at al. [5] consider a bound-constrained problem that is a reformu¬ 
lation of the problem © with specific choices of / and with ft(x) = ||a-i|i- 
Their algorithm performs short-step updates of individual components of x 
in parallel on P processors, with synchronization after each round of parallel 
updating. This scheme essentially updates a randomly-chosen block of P vari¬ 
ables at each cycle. By modifying the analysis of EH, they show that the 1 /k 
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sublinear convergence rate bound is not affected provided that P is no larger 
than n/L, where L is the Lipschitz constant from (23). 

Jaggi et al. (201 perform a synchronized CD method on the dual ERM 
model 0 for the case of g{w ) = g*(w) = (l/2)||w;|| 2 , partitioning components 
of the dual variable x between cores and sharing a copy of the vector Ax across 
cores, updating this vector at each synchronization point. The approach can 
be thought of as a nonlinear block Gauss-Jacobi method (by contrast with the 
coordinate Gauss-Seidel approaches discussed in Section [3]). 

Richtarik and Takac }47; describe a method for the separably regularized 
formulation §, © in which a subset of indices Sk C {1,2,..., rz.} is updated 
according to the formula in Algorithm [2] The work of updating the compo¬ 
nents in Sk is divided between processors; essentially, a synchronization step 
takes place at each iteration. This scheme is enhanced with an acceleration 
step in |15j : the extra computations associated with the acceleration step too 
are parallelized, using ideas from [23]. In the scheme of Marecek, Richtarik, 
and Takac 321 . the variable vector x is partitioned into subvectors, and each 
processor is assigned the responsibility for updating one of these subvectors. 
On each processor, the updating scheme described in m is applied, provid¬ 
ing a second level of parallelism. Synchronization takes place at each outer 
iteration. Details of the information-sharing between processors required for 
accurate computation of gradients in different applications are described in 
132, Section 6]. 


4.2 Asynchronous Parallelism 

In asynchronous variants of CD, the variable vector x is assumed to be acces¬ 
sible to each processor, available for reading and updating. (For example, x 
could be stored in the shared-memory space of a multicore computer, where 
each core is viewed as a processor.) Each processor runs its own CD process, 
shown here as Algorithm [6] without any attempt to coordinate or synchronize 
with other processors. Each iteration on each processor chooses an index i. 
loads the components of x that are needed to compute the gradient compo¬ 
nent [V/(a;)]j, then updates the zth component x, t . Note that this evaluation 
may need only a small subset of the components of x; this is the case when 
the Hessian V 2 / is structurally sparse, for example. On some multicore archi¬ 
tectures (for example, the Intel Xeon), the update of Xi can be performed as a 
unitary operation; no software or hardware locking is required to block access 
of other cores to the location Xi. 


Algorithm 6 Coordinate Descent for (|T|) (running on each Processor) 
repeat 

Choose index i G {1, 2,..., n}; 

Evaluate [V/(cc)]i, reading components of x from shared memory as necessary; 
Update Xi <— — a[V f(x)]i for some a > 0; 

until termination; 
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We can take a global view of the entire parallel process, consisting of mul¬ 
tiple processors each executing Algorithm |6j by defining a global counter k 
that is incremented whenever any processor updates an element of x : see Al¬ 
gorithm [7] Note that the only difference with the basic framework of Algo¬ 
rithm [l] is in the argument of the gradient component: In Algorithm [l] this is 
the latest iterate x k whereas in Algorithm [ 7 ] it is a vector x k that is generally 
made up of components of vectors from previous iterations a : J , j = 0,1,..., k. 
The reason for this discrepancy is that between the time at which a processor 
reads the vector x from shared storage in order to calculate [V f(x)]i, and the 
time at which it updates component i, other processors have generally made 
changes to x. In consequence, each update step is using slightly stale informa¬ 
tion about x. To prove convergence results, we need to make assumptions on 
how much “staleness” can be tolerated, and to modify the convergence anal¬ 
ysis quite substantially. Indeed, proofs of convergence even for the most basic 
asynchronous algorithms are quite technical. 


Algorithm 7 Asynchronous Coordinate Descent for (|T|) 

Set k <— 0 and choose x° E M n ; 

repeat 

Choose index i j, E (1,2,..., n}; 

x k + 1 <— x k — Ofc[V f{x k )\i k ei k for some a t, > 0; 

k i — k "f* 1; 

until termination test satisfied; 


Asynchronous CD algorithms are distinguished from each other mostly by 
the assumptions they make on the the choice of update components ik and 
on the “ages” of the components of x k , that is, the iterations at which each 
component of this vector was last updated. In the terminology of Bertsekas 
and Tsitsiklis A). the algorithm is totally asynchronous if 

(a) each index i £ {1,2,...,«.} of 2 ; is updated at infinitely many iterations; 
and 

(b) if v k denotes the iteration at which component j of the vector x k was last 
updated, then v k —> 00 as k —> 00 for all j = 1,2 ,..., n. 

In other words, each component of x is updated infinitely often, and all com¬ 
ponents used in successive evaluation vectors x k are also updated infinitely 
often. 

The following convergence result for totally asynchronous variants of Al¬ 
gorithm [?] is due to Bertsekas and Tsitsiklis; see in particular [5J Sections 6.1, 
6.2, and 6.3.3]. 

Theorem 5 Suppose that the problem ([I]) has a unique solution x* and that 
f is convex and continuously differentiable. Suppose that Algorithm^is imple¬ 
mented in a totally asynchronous fashion. Suppose that the mapping T defined 
by T(x) := i-aV/(i) for some a > 0 (for which x* is the unique fixed point) 
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is strictly contractive in the norm, that is, 


||T(x) - ^lloo < r/\\x - ^Hoo, for some rj £ {0,1). (48) 

Then if we set ak = a in Algorithm [?| the sequence {x fc } converges to x*. 

We cannot expect to obtain a convergence rate in this setting (such as sublinear 
with rate 1 /k), given that the assumptions on the ages of the components in 
x k are so weak. Although this result can be generalized impressively and its 
proof is not too complex, we should note that the £oo contraction assumption 


(48) is quite strong. It is violated even by some strictly convex objectives /. 
For example, when f(x) = (1/2 )x T Qx with 


Q = 


11 

1 2 ’ 


we have / strictly convex with minimizer x* =0. However the mapping 
T{x) = {I — aQ)x is not contractive for any a > 0; we have for example 
that ||T'(a ;)|| 00 > ||x||oo when x = ( 1 , - 1 ) T . 

We turn now to pai'tly asynchronous variants of Algorithm [TJ in which 
we make stronger assumptions on the ages of the components of x \ Liu and 
Wright [ 27 \ consider a version of Algorithm [ 7 ] that is the parallel analog of 
Algorithm [3j in that each update component ik is chosen independently and 
randomly with equal probability from {1,2,..., n}. They assume that no com¬ 
ponent of x k is older than a nonnegative integer r - the “maximum delay” 
for any k. Specifically, they express the difference between x k and x k in 
terms of “missed updates” to x, as follows: 

x k =x k + J2 0 x l+1 ~x l ), (49) 

ieK(j) 

where K{j) is a set of iteration numbers drawn from the set {j — q : q = 
1, 2,..., t}. The value of r is related to the number of processors P involved 
in the computation. If all processors are performing their updates at approx¬ 
imately the same rates, we could expect r to be a modest multiple of P 
perhaps r = 2 P or r = 3 P, to allow a safety margin for occasional delays. 
Hence the value of r is an indicator of potential parallelism in the algorithm. 
In [27], the steplengths in Algorithm [ 7 ] are fixed as follows: 

Oik = (50) 

■^max 

where 7 is chosen to ensure that Algorithm[7]progresses steadily toward a solu¬ 
tion, but not too rapidly. Too-rapid convergence would cause the information 
in x k to become too stale too quickly, so the gradient component [V/(x fe )]j fc 
would lose its relevance as a suitable update for the variable component Xi k 
at iteration k. Steady convergence is enforced by choosing some p > 1 and 
requiring that 

E\\x k ~ x — x k \\ 2 < pE\\x k — x fe+1 || 2 , (51) 
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where x k is the vector that would hypothetically be obtained if we were to 
apply the the update to all components, that is, 


x k+x := x k - 


7 


-V/(s* 


and the expectations E{-) are taken over all random variables io, 12 , ■ ■ ■ ■ Con¬ 
dition (511 ensures that the “expected squared update norms” decrease by at 


most a factor of l/p at each iteration. 

The main results in m apply to composite function s (pj ), ([ 3 ]), but for sim¬ 
plicity here we state the result in terms of the problem ( 111 ), where / is convex 
and continuously differentiable, with nonempty solution set S and optimal 
objective value f*. We use Ps to denote projection onto S , and recall the 


definition (25) of the ratio A between different varieties of Lipschitz constants. 


The results also make use of an optimal strong convexity condition, which is 
that the following inequality holds for some a > 0: 


f(x)-r>-\\x-p s ( X ) f, 


for all x. 


(52) 


The following result is a modification of [23 Corollary 2]. 
Theorem 6 Suppose that Assumption^ holds, and that 

4eA(r + l ) 2 < \fn. 


(53) 


Then by setting 7 = 1/2 in ( 50 ) (that is, choosing steplengths ak = 1/(2L max )), 
we have that 

E (/(**)) - r < ^-ll- 0 -^^°)ll 2 + ^ 0 )-^) . (54) 


Assuming in addition that (52) is satisfied for some a > 0, we obtain the 
following linear rate: 

E(f{x k ))-r 


< 1- 


n((T T 2T max ) 


(£max||* U - PsOOU 2 + f{x U ) - /*)• (55) 


A comparison with Theorem [T[ which shows convergence rates for serial ran¬ 
domized CD (Algorithm [ 3 ]) shows a striking similarity in convergence bounds. 
The factor-of-2 difference in steplength between the serial and parallel vari¬ 
ants accounts for most of the difference between the linear rates (l28l and (l55l , 


while there is an extra term n in the denominator of the sublinear rate (54). We 


conclude that we do not pay q high overhead (in terms of total workload) for 
parallel implementation, and hence that near-linear speedup can be expected. 
(Indeed, computational results in [27jj and |28| observe near-linear speedup for 
multicore asynchronous implementations.) 


These encouraging conclusions depend critically on the condition (531 


which is an upper bound on the allowable delay r in terms of n and the ratio 












Coordinate Descent Algorithms 


29 


A from (251. For functions / with weak coupling between the components of 
x (for example, when off-diagonals in the Hessian X7 2 f(x) are small relative to 
the diagonals), we have A not much greater than 1 , so the maximum delay can 
be of the order of tr 1 / 4 before there is any attenuation of linear speedup. When 
stronger coupling exists, the restriction on r may be quite tight, possibly not 
much greater than 1. A more general convergence result [27l Theorem 1] shows 
that in this case, we can choose smaller values of 7 in (50), allowing grace¬ 
ful degradation of the convergence bounds while still obtaining fairly efficient 
parallel implementations. 

We note that an earlier analysis in [28] made a stronger assumption on x k 

that it is equal to some earlier iterate x of Algorithm]?] where k > j > k—r , 
that is, the earlier iterate is no more than r cycles old. (A similar assumption 
was used to analyze convergence of as asynchronous stochastic gradient algo¬ 
rithm in |39|.) This stronger assumption yields stronger convergence results, 
in that the bound on r in (53) can be loosened. However, the assumption may 
not always hold, since some parts of x in memory may be altered by some 
cores as they are being read by another core, a phenomenon referred to in m 
as “inconsistent reading.” 


5 Conclusion 

We have surveyed the state of the art in convergence of coordinate descent 
methods, with a focus on the most elementary settings and the most funda¬ 
mental algorithms. The recent literature contains many extensions, enhance¬ 
ments, and elaborations; we refer interested readers to the bibliography of this 
paper, and note that new works are appearing at a rapid pace. 

Coordinate descent method have become an important tool in the opti¬ 
mization toolbox that is used to solve problems that arise in machine learning 
and data analysis, particularly in “big data” settings. We expect to see fur¬ 
ther developments and extensions, further customization of the approach to 
specific problem structures, further adaptation to various computer platforms, 
and novel combinations with other optimization tools to produce effective “so¬ 
lutions” for key application areas. 
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